-
Notifications
You must be signed in to change notification settings - Fork 67
Add testing DSS with Canonical Kubernetes (New) #1793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We will need helm when installing canonical k8s to enable NVIDIA GPU operator in it. Canonical k8s (and helm) will only be installed if the explicit argument for the channel to use is provided. Otherwise, the old default behaviour of install microk8s will be maintained.
We use helm to add the relevant chart from nvidia and install the chart. We re-use existing script to verify the rollout too.
This job needs to run after either one of the two jobs above it enabling NVIDIA GPU in the k8s cluster succeed (one is for microk8s, the other is for Canonical k8s). We can't use those jobs in 'depends' because then both of them will have to succeed, which is impossible because only one of either microk8s or Canonical k8s will be available. The trick we use here is that we now `depends` on `dss/initialize`, which must succeed for the whole test-plan to be run anyway, and, we require that we have an NVIDIA GPU. This is similar to the `depends` for the two jobs for microk8s and Canonical k8s. Then we will have to be careful that this job is added in the test-plan to ONLY run after those two jobs. The difference will then be that this job will not be skipped if either of the two jobs enabling NVIDIA GPU fail.
We now have an addition to the `install-deps` script. It also demarcates from whence we started supporting Canonical K8s
The "worker" daemonset that was being verified may have a version number in its name, which we cannot predict.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1793 +/- ##
==========================================
+ Coverage 50.44% 50.67% +0.23%
==========================================
Files 382 384 +2
Lines 41026 41219 +193
Branches 6890 6890
==========================================
+ Hits 20696 20889 +193
Misses 19585 19585
Partials 745 745
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
NVIDIA GPU operator can be enabled in both microk8s and canonical k8s using helm, so we remove all the ugly parts so far that was trying to handle whether microk8s or canonical k8s were installed, and just use the unified approach using helm. Helm now becomes a hard requirement.
|
As discussed in:
|
We do not need to customize the slots_per_gpu today, so let's keep it very simple, and directly apply the Kustomize configurations
KUBECONFIG is actually used by kubectl, and other tools, and this is not the right use of it
@fernando79513 ... I was able to compress setting up K8s for NVIDIA and Intel GPUs, and moved them into a Python script (see this commit) ... Is this what you were expecting? Personally, now that the setup is nicely compressed, I don't see too much value in wrapping them in a Python script. Let me know if you still prefer this Python script, and what sort of unit-tests you believe it requires. |
Modified from copy of checkbox-ce-oem native build
Currently only requires passing environment variables pointing to the containerd config and socket paths for microk8s. This customisation is not required on Canonical K8s.
The original job is renamed and should continue to work. We need a slightly different job for installing the gpu operator on microk8s.
This reverts commit 02138ef.
We are going to detect it
The labels take some time to propagate
If there's a TimeoutError, `microk8s status` was still executing, so it is there ... even thought it does not tell us whether microk8s is in use or not. Anyway, throw the error, instead of deciding that there is no microk8s.
The validator container may not have been created even after the daemonset is rolled out (for some unknown reason), hence we wait before checking the logs. And, since checking the logs will wait for the validations so succeed, this job may take considerably longer.
It is FileNotFoundError that is raised when microk8s is not installed. So we are not going to try catch all other CalledProcessErrors, for now.
we need snapd > 2.59 for SNAP_UID
|
@fernando79513 ... Sorry for the delay, but I got stuck in some weird behaviour of Helm-installing the Nvidia GPU operator (see my comment in the script). Furthermore, I needed to add some special handling for installing the Nvidia operator on |
|
Closing this without merging. To be picked again as part of CHECKBOX-1898 |
Description
dssnow supports running on Canonical Kubernetes instead ofmicrok8s. This support is currently available in channels for version1.1. This PR adds support for testingdsson Canoncial Kubernetes.Updates to the provider
install-depsscript now accepts an argument to install Canonical Kubernetes instead ofmicrok8s. It still installsmicrok8sby default.install-depsnow also installs thehelmsnap which will be used to enable NVIDIA GPU support in both the Kubernetes variants.k8s_gpu_setup.pythat can be used to setup GPU from both Intel and NVIDIA on bothmicrok8sand Canonical Kubernetes.kubectl applywith-kdirectly. Support for setting a specific number of slots-per-GPU has been removed as it is not relevant to testing DSS at the moment.helmis used for enabling NVIDIA GPU support in the Kubernetes cluster, roughly following this guide.microk8sis detected by the script and relevant customisation forcontainerdis done by the script automatically.The provider-snap's minor version has been bumped to indicate since when we added support for Canonical Kubernetes.
Updates to the GitHub Workflows
checkbox-dss-build.yamlto build thecheckbox-dsssnap.Resolved issues
Documentation
Updated the README for this provider. No changes to main Checkbox documentation.
Tests
One of the machines has been failing to provision today, but the tests have passed on the other two machines using the updated workflow.